Medical Dataset - Segmenting Patients¶
InĀ [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
InĀ [2]:
df=pd.read_csv('patient_dataset.csv', index_col=0)
df.head()
Out[2]:
| age | gender | chest pain type | blood pressure | cholesterol | max heart rate | exercise angina | plasma glucose | skin_thickness | insulin | bmi | diabetes_pedigree | hypertension | heart_disease | Residence_type | smoking_status | triage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40.0 | 1.0 | 2.0 | 140.0 | 294.0 | 172.0 | 0.0 | 108.0 | 43.0 | 92.0 | 19.0 | 0.467386 | 0.0 | 0.0 | Urban | never smoked | yellow |
| 1 | 49.0 | 0.0 | 3.0 | 160.0 | 180.0 | 156.0 | 0.0 | 75.0 | 47.0 | 90.0 | 18.0 | 0.467386 | 0.0 | 0.0 | Urban | never smoked | orange |
| 2 | 37.0 | 1.0 | 2.0 | 130.0 | 294.0 | 156.0 | 0.0 | 98.0 | 53.0 | 102.0 | 23.0 | 0.467386 | 0.0 | 0.0 | Urban | never smoked | yellow |
| 3 | 48.0 | 0.0 | 4.0 | 138.0 | 214.0 | 156.0 | 1.0 | 72.0 | 51.0 | 118.0 | 18.0 | 0.467386 | 0.0 | 0.0 | Urban | never smoked | orange |
| 4 | 54.0 | 1.0 | 3.0 | 150.0 | 195.0 | 156.0 | 0.0 | 108.0 | 90.0 | 83.0 | 21.0 | 0.467386 | 0.0 | 0.0 | Urban | never smoked | yellow |
Defining Problem Statement and perform Exploratory Data Analysis¶
Triage refers to the sorting of injured or sick people according to their need for emergency medical attention. It is a method of determining priority for who gets care first.
As the dataset is seen
Based on patient symptoms,
Identify patients needing immediate resuscitation;
To assign patients to a predesignated patient care area,
Thereby prioritizing their care;
And to initiate diagnostic/therapeutic measures as appropriate
The dataset includes demographic, lifestyle, and health-related features, such as age, gender, cholesterol levels, blood pressure, BMI, diabetes history, and smoking status.
Apply unsupervised learning techniques such as K-Means, Gaussian Mixture Model and Hierarchical Clustering to segment the data into meaningful clusters.
The study will explore whether these clusters reveal distinct patient groups that could be useful for medical research, risk stratification, or personalized treatment plans.
Observations on the data types of all the attributes¶
InĀ [3]:
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 6962 entries, 0 to 5109 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 6962 non-null float64 1 gender 6961 non-null float64 2 chest pain type 6962 non-null float64 3 blood pressure 6962 non-null float64 4 cholesterol 6962 non-null float64 5 max heart rate 6962 non-null float64 6 exercise angina 6962 non-null float64 7 plasma glucose 6962 non-null float64 8 skin_thickness 6962 non-null float64 9 insulin 6962 non-null float64 10 bmi 6962 non-null float64 11 diabetes_pedigree 6962 non-null float64 12 hypertension 6962 non-null float64 13 heart_disease 6962 non-null float64 14 Residence_type 6962 non-null object 15 smoking_status 6962 non-null object 16 triage 6552 non-null object dtypes: float64(14), object(3) memory usage: 979.0+ KB
Missing value check¶
InĀ [4]:
print('Missing Values in the dataset ')
df.isna().sum()
Missing Values in the dataset
Out[4]:
age 0 gender 1 chest pain type 0 blood pressure 0 cholesterol 0 max heart rate 0 exercise angina 0 plasma glucose 0 skin_thickness 0 insulin 0 bmi 0 diabetes_pedigree 0 hypertension 0 heart_disease 0 Residence_type 0 smoking_status 0 triage 410 dtype: int64
InĀ [5]:
print("Total Missing Values ")
df.isna().sum().sum()
Total Missing Values
Out[5]:
411
Outlier detection¶
InĀ [6]:
# Extracting Numerical data from the pool
numeric_data=df.select_dtypes('number')
numeric_data.head(5)
Out[6]:
| age | gender | chest pain type | blood pressure | cholesterol | max heart rate | exercise angina | plasma glucose | skin_thickness | insulin | bmi | diabetes_pedigree | hypertension | heart_disease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40.0 | 1.0 | 2.0 | 140.0 | 294.0 | 172.0 | 0.0 | 108.0 | 43.0 | 92.0 | 19.0 | 0.467386 | 0.0 | 0.0 |
| 1 | 49.0 | 0.0 | 3.0 | 160.0 | 180.0 | 156.0 | 0.0 | 75.0 | 47.0 | 90.0 | 18.0 | 0.467386 | 0.0 | 0.0 |
| 2 | 37.0 | 1.0 | 2.0 | 130.0 | 294.0 | 156.0 | 0.0 | 98.0 | 53.0 | 102.0 | 23.0 | 0.467386 | 0.0 | 0.0 |
| 3 | 48.0 | 0.0 | 4.0 | 138.0 | 214.0 | 156.0 | 1.0 | 72.0 | 51.0 | 118.0 | 18.0 | 0.467386 | 0.0 | 0.0 |
| 4 | 54.0 | 1.0 | 3.0 | 150.0 | 195.0 | 156.0 | 0.0 | 108.0 | 90.0 | 83.0 | 21.0 | 0.467386 | 0.0 | 0.0 |
InĀ [7]:
plt.figure(figsize=(15, 15), layout="constrained", frameon=True)
i=1
for col in numeric_data.columns:
plt.subplot(4, 4, i)
sns.boxplot(df[col], color="#e63946")
plt.title(col)
i += 1
plt.show()
- As from the above Graph we see the potential Features for Outliers can be :
- cholesterol
- plasma glucose
- insulin
- bmi
- diabetes_pedigree
InĀ [8]:
outlier_features=df[['cholesterol', 'plasma glucose', 'insulin', 'bmi', 'diabetes_pedigree']]
InĀ [9]:
plt.figure(figsize=(15, 8), layout="constrained", frameon=True)
i = 1
for col in outlier_features:
plt.subplot(2, 3, i)
sns.histplot(df[col], kde=True, color="#2a9d8f")
plt.title(col)
i += 1
plt.show()
Relationship between important variables¶
InĀ [10]:
plt.figure(figsize=(15, 6))
sns.lineplot(
x=df["hypertension"],
y=df["max heart rate"],
hue=df["triage"],
errorbar=None,
hue_order=["yellow", "orange", "green", "red"],
)
plt.title("Max Heart Rate vs. Hypertension")
plt.show()
InĀ [11]:
plt.figure(figsize=(15, 6))
sns.barplot(
x=df["exercise angina"],
y=df["max heart rate"],
hue=df["triage"],
hue_order=["yellow", "orange", "green", "red"],
)
plt.title("Max Heart Rate vs. Exercise Angina")
plt.show()
InĀ [12]:
plt.figure(figsize=(15, 6))
sns.violinplot(
x=df["heart_disease"], y=df["age"], palette="coolwarm", hue=df["heart_disease"]
)
plt.title("Age vs. Heart Disease")
plt.show()
InĀ [13]:
plt.figure(figsize=(15, 8))
sns.lineplot(y=df["bmi"], x=df["age"], hue=df["smoking_status"])
plt.show()
InĀ [14]:
plt.figure(figsize=(15, 8))
sns.barplot(x=df["chest pain type"], y=df["age"], hue=df["smoking_status"])
plt.title("Chest Pain Type vs. Age on Smoking Status")
plt.show()
InĀ [15]:
plt.figure(figsize=(20,12))
plt.subplot(2,2,1)
sns.histplot(df["age"], kde=True, color="#cdb4db")
plt.title("Age Distribution")
plt.subplot(2, 2, 2)
sns.histplot(df["cholesterol"], kde=True, color="#219ebc")
plt.title("Cholesterol Distribution")
plt.subplot(2, 2, 3)
sns.histplot(df["max heart rate"], kde=True, color="#9b5de5")
plt.title("Max Heart Rate Distribution")
plt.subplot(2, 2, 4)
sns.histplot(df["blood pressure"], kde=True, color="#3a5a40")
plt.title("Blood Pressure Distribution")
plt.show()
Data Preprocessing¶
Imputation¶
InĀ [16]:
# Handling Missing values
missing_values_data=df.isna().sum()[df.isna().sum()>0]
sns.heatmap(df.isnull(), cbar=False, cmap="viridis", yticklabels=False)
plt.title("Heatmap of Missing Values")
plt.show()
InĀ [17]:
#Calculating the percentage of missing values
missing_percentage = round((missing_values_data / len(df)) * 100, 2)
missing_data_summary = pd.DataFrame(
{
"Missing Values": missing_values_data[missing_values_data > 0],
"Percentage (%)": missing_percentage[missing_values_data > 0],
}
).sort_values(by="Percentage (%)", ascending=False)
print(missing_data_summary)
Missing Values Percentage (%) triage 410 5.89 gender 1 0.01
InĀ [18]:
# Handling triage
# df['triage'].value_counts()
InĀ [19]:
# Null values filled with mode
# df["triage"] = df["triage"].fillna("yellow")
InĀ [20]:
df["gender"].value_counts()
Out[20]:
gender 1.0 3703 0.0 3258 Name: count, dtype: int64
InĀ [21]:
df["gender"] = df["gender"].fillna(0.0)
InĀ [22]:
df.isnull().sum()
Out[22]:
age 0 gender 0 chest pain type 0 blood pressure 0 cholesterol 0 max heart rate 0 exercise angina 0 plasma glucose 0 skin_thickness 0 insulin 0 bmi 0 diabetes_pedigree 0 hypertension 0 heart_disease 0 Residence_type 0 smoking_status 0 triage 410 dtype: int64
Outlier Treatment¶
InĀ [23]:
plt.figure(figsize=(15, 8), layout="constrained", frameon=True)
i = 1
for col in outlier_features:
plt.subplot(2, 3, i)
sns.histplot(df[col], kde=True, color="#2a9d8f")
plt.title(col)
i += 1
plt.show()
Transform Data to Reduce Impact
- Log Transformation (Best for right-skewed data)
InĀ [24]:
outlier_features.columns
Out[24]:
Index(['cholesterol', 'plasma glucose', 'insulin', 'bmi', 'diabetes_pedigree'], dtype='object')
InĀ [25]:
# cholesterol
df["cholesterol"] = np.log1p(df["cholesterol"])
InĀ [26]:
# plasma glucose
df["plasma glucose"] = np.log1p(df["plasma glucose"])
InĀ [27]:
# insulin
df["insulin"] = np.log1p(df["insulin"])
InĀ [28]:
# bmi
df["bmi"] = np.log1p(df["bmi"])
InĀ [29]:
# diabetes_pedigree
df["diabetes_pedigree"] = np.log1p(df["diabetes_pedigree"])
InĀ [30]:
plt.figure(figsize=(15, 8), layout="constrained", frameon=True)
i = 1
for col in outlier_features:
plt.subplot(2, 3, i)
sns.histplot(df[col], kde=True, color="#2a9d8f")
plt.title(col)
i += 1
plt.show()
Encoding all the categorical attributes¶
InĀ [31]:
categorical_data=df.select_dtypes("object")
categorical_data.head(5)
Out[31]:
| Residence_type | smoking_status | triage | |
|---|---|---|---|
| 0 | Urban | never smoked | yellow |
| 1 | Urban | never smoked | orange |
| 2 | Urban | never smoked | yellow |
| 3 | Urban | never smoked | orange |
| 4 | Urban | never smoked | yellow |
InĀ [32]:
# Encoding residence_type
df['Residence_type'].value_counts().index
Out[32]:
Index(['Urban', 'Rural'], dtype='object', name='Residence_type')
InĀ [33]:
Residence_type_map = {"Urban": 0, "Rural": 1}
df['Residence_type']=df['Residence_type'].map(Residence_type_map)
df['Residence_type']
Out[33]:
0 0
1 0
2 0
3 0
4 0
..
5105 0
5106 0
5107 1
5108 1
5109 0
Name: Residence_type, Length: 6962, dtype: int64
InĀ [34]:
# Encoding smoking_status
df["smoking_status"].value_counts().index
Out[34]:
Index(['never smoked', 'Unknown', 'formerly smoked', 'smokes'], dtype='object', name='smoking_status')
InĀ [35]:
smoking_status = {"never smoked":0, "Unknown":2, "formerly smoked":0.5, "smokes":1}
df["smoking_status"] = df["smoking_status"].map(smoking_status)
df["smoking_status"]
Out[35]:
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
...
5105 0.0
5106 0.0
5107 0.0
5108 0.5
5109 2.0
Name: smoking_status, Length: 6962, dtype: float64
InĀ [36]:
# triage
df["triage"].value_counts().index
Out[36]:
Index(['yellow', 'green', 'orange', 'red'], dtype='object', name='triage')
InĀ [37]:
# triage_map = {"yellow": 0, "orange": 1, "green": 2, "red": 3}
# df['triage'] = df['triage'].map(triage_map)
# df['triage']
Standardization¶
InĀ [88]:
X = df.drop("triage", axis=1)
y = df["triage"]
InĀ [89]:
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
X[X.columns] = scaler.fit_transform(X)
X
Out[89]:
| age | gender | chest pain type | blood pressure | cholesterol | max heart rate | exercise angina | plasma glucose | skin_thickness | insulin | bmi | diabetes_pedigree | hypertension | heart_disease | Residence_type | smoking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.465884 | 0.938135 | 1.173314 | 1.410374 | 3.129483 | 0.549734 | -0.256573 | 0.485469 | -0.603531 | -1.109043 | -1.256305 | 0.033263 | -0.277565 | -0.202792 | -0.751562 | -0.771064 |
| 1 | -0.709841 | -1.065945 | 1.970952 | 2.339167 | -0.086923 | -0.485357 | -0.256573 | -0.870489 | -0.428764 | -1.247255 | -1.463029 | 0.033263 | -0.277565 | -0.202792 | -0.751562 | -0.771064 |
| 2 | -1.717898 | 0.938135 | 1.173314 | 0.945977 | 3.129483 | -0.485357 | -0.256573 | 0.123639 | -0.166614 | -0.459755 | -0.521504 | 0.033263 | -0.277565 | -0.202792 | -0.751562 | -0.771064 |
| 3 | -0.793846 | -1.065945 | 2.768591 | 1.317494 | 1.046546 | -0.485357 | 3.897525 | -1.021924 | -0.253998 | 0.458231 | -1.463029 | 0.033263 | -0.277565 | -0.202792 | -0.751562 | -0.771064 |
| 4 | -0.289817 | 0.938135 | 1.970952 | 1.874771 | 0.437322 | -0.485357 | -0.256573 | 0.485469 | 1.449976 | -1.756125 | -0.872181 | 0.033263 | -0.277565 | -0.202792 | -0.751562 | -0.771064 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5105 | 1.894305 | -1.065945 | -0.421962 | 0.063623 | -1.150619 | 0.161575 | -0.256573 | -0.460738 | -1.127831 | -0.099801 | -1.337727 | 0.033263 | 3.602766 | -0.202792 | -0.751562 | -0.771064 |
| 5106 | 1.978310 | -1.065945 | -0.421962 | 0.620899 | -0.981776 | -0.226584 | -0.256573 | 1.036404 | -1.477364 | -1.317504 | 1.636766 | 0.033263 | -0.277565 | -0.202792 | -0.751562 | -0.771064 |
| 5107 | 1.978310 | -1.065945 | -0.421962 | 0.806658 | 0.092503 | -1.455754 | -0.256573 | -0.494609 | -0.690914 | -0.907201 | 0.587230 | 0.033263 | -0.277565 | -0.202792 | 1.330562 | -0.771064 |
| 5108 | -0.541832 | 0.938135 | -0.421962 | 0.620899 | -0.817154 | -0.097198 | -0.256573 | 2.096238 | -0.996756 | -1.041048 | -0.106963 | 0.033263 | -0.277565 | -0.202792 | 1.330562 | -0.149607 |
| 5109 | -1.129865 | -1.065945 | -0.421962 | 0.713778 | -0.234070 | 0.549734 | -0.256573 | -0.393462 | 0.008152 | 0.185336 | -0.017066 | 0.033263 | -0.277565 | -0.202792 | -0.751562 | 1.714764 |
6962 rows Ć 16 columns
InĀ [90]:
y
Out[90]:
0 yellow
1 orange
2 yellow
3 orange
4 yellow
...
5105 yellow
5106 yellow
5107 yellow
5108 green
5109 yellow
Name: triage, Length: 6962, dtype: object
InĀ [41]:
df.to_csv('cleaned_patient_dataset.csv', index=False)
Correlation between all the attributes¶
InĀ [91]:
plt.figure(figsize=(15, 7), layout="constrained")
sns.heatmap(data=X.corr(), annot=True, cmap='Blues')
plt.show()
Model Training¶
InĀ [2]:
df=pd.read_csv('cleaned_patient_dataset.csv')
df.head()
Out[2]:
| age | gender | chest pain type | blood pressure | cholesterol | max heart rate | exercise angina | plasma glucose | skin_thickness | insulin | bmi | diabetes_pedigree | hypertension | heart_disease | Residence_type | smoking_status | triage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40.0 | 1.0 | 2.0 | 140.0 | 5.686975 | 172.0 | 0.0 | 4.691348 | 43.0 | 4.532599 | 2.995732 | 0.383483 | 0.0 | 0.0 | 0 | 0.0 | yellow |
| 1 | 49.0 | 0.0 | 3.0 | 160.0 | 5.198497 | 156.0 | 0.0 | 4.330733 | 47.0 | 4.510860 | 2.944439 | 0.383483 | 0.0 | 0.0 | 0 | 0.0 | orange |
| 2 | 37.0 | 1.0 | 2.0 | 130.0 | 5.686975 | 156.0 | 0.0 | 4.595120 | 53.0 | 4.634729 | 3.178054 | 0.383483 | 0.0 | 0.0 | 0 | 0.0 | yellow |
| 3 | 48.0 | 0.0 | 4.0 | 138.0 | 5.370638 | 156.0 | 1.0 | 4.290459 | 51.0 | 4.779123 | 2.944439 | 0.383483 | 0.0 | 0.0 | 0 | 0.0 | orange |
| 4 | 54.0 | 1.0 | 3.0 | 150.0 | 5.278115 | 156.0 | 0.0 | 4.691348 | 90.0 | 4.430817 | 3.091042 | 0.383483 | 0.0 | 0.0 | 0 | 0.0 | yellow |
InĀ [3]:
X=df.drop('triage', axis=1)
X
Out[3]:
| age | gender | chest pain type | blood pressure | cholesterol | max heart rate | exercise angina | plasma glucose | skin_thickness | insulin | bmi | diabetes_pedigree | hypertension | heart_disease | Residence_type | smoking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40.0 | 1.0 | 2.0 | 140.0 | 5.686975 | 172.0 | 0.0 | 4.691348 | 43.0 | 4.532599 | 2.995732 | 0.383483 | 0.0 | 0.0 | 0 | 0.0 |
| 1 | 49.0 | 0.0 | 3.0 | 160.0 | 5.198497 | 156.0 | 0.0 | 4.330733 | 47.0 | 4.510860 | 2.944439 | 0.383483 | 0.0 | 0.0 | 0 | 0.0 |
| 2 | 37.0 | 1.0 | 2.0 | 130.0 | 5.686975 | 156.0 | 0.0 | 4.595120 | 53.0 | 4.634729 | 3.178054 | 0.383483 | 0.0 | 0.0 | 0 | 0.0 |
| 3 | 48.0 | 0.0 | 4.0 | 138.0 | 5.370638 | 156.0 | 1.0 | 4.290459 | 51.0 | 4.779123 | 2.944439 | 0.383483 | 0.0 | 0.0 | 0 | 0.0 |
| 4 | 54.0 | 1.0 | 3.0 | 150.0 | 5.278115 | 156.0 | 0.0 | 4.691348 | 90.0 | 4.430817 | 3.091042 | 0.383483 | 0.0 | 0.0 | 0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6957 | 80.0 | 0.0 | 0.0 | 111.0 | 5.036953 | 166.0 | 0.0 | 4.439706 | 31.0 | 4.691348 | 2.975530 | 0.383483 | 1.0 | 0.0 | 0 | 0.0 |
| 6958 | 81.0 | 0.0 | 0.0 | 123.0 | 5.062595 | 160.0 | 0.0 | 4.837868 | 23.0 | 4.499810 | 3.713572 | 0.383483 | 0.0 | 0.0 | 0 | 0.0 |
| 6959 | 81.0 | 0.0 | 0.0 | 127.0 | 5.225747 | 141.0 | 0.0 | 4.430698 | 41.0 | 4.564348 | 3.453157 | 0.383483 | 0.0 | 0.0 | 1 | 0.0 |
| 6960 | 51.0 | 1.0 | 0.0 | 123.0 | 5.087596 | 162.0 | 0.0 | 5.119729 | 34.0 | 4.543295 | 3.280911 | 0.383483 | 0.0 | 0.0 | 1 | 0.5 |
| 6961 | 44.0 | 0.0 | 0.0 | 125.0 | 5.176150 | 172.0 | 0.0 | 4.457598 | 57.0 | 4.736198 | 3.303217 | 0.383483 | 0.0 | 0.0 | 0 | 2.0 |
6962 rows Ć 16 columns
InĀ [4]:
y=df['triage']
y
Out[4]:
0 yellow
1 orange
2 yellow
3 orange
4 yellow
...
6957 yellow
6958 yellow
6959 yellow
6960 green
6961 yellow
Name: triage, Length: 6962, dtype: object
K-Means Clustering¶
InĀ [10]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
inertia = []
silhouette_scores = []
K_range = range(2, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertia.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X, kmeans.labels_))
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(K_range, inertia, marker="o")
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.subplot(1, 2, 2)
plt.plot(K_range, silhouette_scores, marker="o", color="r")
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score Analysis")
plt.show()
Observations from the Elbow Method¶
- The inertia drops sharply from k=2 to k=4, but after that, the decrease slows down.
- The "elbow" (point where the curve starts to flatten) seems to be around k=4 or k=5.
Observations from the Silhouette Score (Right Plot)¶
- Highest silhouette score at k=2 (0.28), meaning two well-separated clusters exist.
- The score gradually decreases as k increases, indicating overlapping clusters.
- After k=6, the silhouette score becomes very low (~0.21), suggesting poor separation.
InĀ [14]:
optimal_k=4
InĀ [12]:
kmeans = KMeans(n_clusters=optimal_k, init="k-means++",random_state=42)
kmeans.fit(X, y)
Out[12]:
KMeans(n_clusters=4, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4, random_state=42)
InĀ [35]:
# 2D Visualization using TSNE
from sklearn.manifold import TSNE
tsne = TSNE(2 , perplexity=200, n_iter=300)
components_tsne = tsne.fit_transform(X)
InĀ [36]:
Kmeans_data = np.vstack((components_tsne.T, kmeans.labels_)).T
Kmeans_data
Out[36]:
array([[-0.60867536, 4.61441231, 0. ],
[ 1.26883852, 7.63253117, 0. ],
[ 2.23190188, 6.16152525, 0. ],
...,
[-0.61595494, 6.51414299, 0. ],
[-2.77767849, 5.40629578, 0. ],
[ 1.26801634, 1.6698674 , 0. ]])
InĀ [37]:
Kmeans_tsne = pd.DataFrame(Kmeans_data, columns=["X1", "X2", "clusters"])
Kmeans_tsne.head(10)
Out[37]:
| X1 | X2 | clusters | |
|---|---|---|---|
| 0 | -0.608675 | 4.614412 | 0.0 |
| 1 | 1.268839 | 7.632531 | 0.0 |
| 2 | 2.231902 | 6.161525 | 0.0 |
| 3 | 1.539832 | 5.844163 | 0.0 |
| 4 | 7.277162 | 2.938786 | 2.0 |
| 5 | -0.075848 | 1.356522 | 0.0 |
| 6 | -1.268180 | 4.740068 | 0.0 |
| 7 | 7.494108 | -1.875750 | 1.0 |
| 8 | 4.927003 | 5.426473 | 2.0 |
| 9 | -3.845872 | 6.579289 | 0.0 |
InĀ [38]:
plt.figure(figsize=(15, 8))
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Kmeans_tsne, palette="tab10")
plt.title("KMeans++ Clustering")
plt.show()
Gaussian Mixture Model¶
InĀ [18]:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=optimal_k, covariance_type="full")
gmm_model=gmm.fit(X)
InĀ [19]:
GMM_labels = gmm_model.predict(X)
GMM_labels
Out[19]:
array([0, 0, 0, ..., 1, 1, 1], dtype=int64)
InĀ [20]:
from sklearn.metrics import silhouette_score
silhouette_score(X, GMM_labels)
Out[20]:
-0.043183584125432585
InĀ [21]:
GMM_data = np.vstack((components_tsne.T, GMM_labels)).T
GMM_data
Out[21]:
array([[-0.60867536, 4.61441231, 0. ],
[ 1.26883852, 7.63253117, 0. ],
[ 2.23190188, 6.16152525, 0. ],
...,
[-0.61595494, 6.51414299, 1. ],
[-2.77767849, 5.40629578, 1. ],
[ 1.26801634, 1.6698674 , 1. ]])
InĀ [22]:
GMM_tsne = pd.DataFrame(GMM_data, columns=["X1", "X2", "clusters"])
GMM_tsne.head(10)
Out[22]:
| X1 | X2 | clusters | |
|---|---|---|---|
| 0 | -0.608675 | 4.614412 | 0.0 |
| 1 | 1.268839 | 7.632531 | 0.0 |
| 2 | 2.231902 | 6.161525 | 0.0 |
| 3 | 1.539832 | 5.844163 | 3.0 |
| 4 | 7.277162 | 2.938786 | 0.0 |
| 5 | -0.075848 | 1.356522 | 0.0 |
| 6 | -1.268180 | 4.740068 | 0.0 |
| 7 | 7.494108 | -1.875750 | 0.0 |
| 8 | 4.927003 | 5.426473 | 3.0 |
| 9 | -3.845872 | 6.579289 | 0.0 |
InĀ [23]:
plt.figure(figsize=(15, 8))
sns.scatterplot(x="X1", y="X2", hue="clusters", data=GMM_tsne, palette="tab10")
plt.title('GMM Cluters')
plt.show()
Hierarchical Clustering¶
InĀ [57]:
from scipy.cluster import hierarchy
Z=hierarchy.linkage(X, method='ward')
InĀ [58]:
Z.shape, Z
Out[58]:
((6961, 4),
array([[6.22000000e+02, 1.05000000e+03, 3.65905944e-01, 2.00000000e+00],
[1.36400000e+03, 1.36500000e+03, 3.89680688e-01, 2.00000000e+00],
[6.41000000e+02, 8.96000000e+02, 4.38727642e-01, 2.00000000e+00],
...,
[1.38980000e+04, 1.39090000e+04, 1.09549943e+02, 7.09000000e+02],
[1.39190000e+04, 1.39200000e+04, 1.23591245e+02, 5.83200000e+03],
[1.39180000e+04, 1.39210000e+04, 1.85486960e+02, 6.96200000e+03]]))
InĀ [59]:
plt.figure(figsize=(12, 10))
hierarchy.dendrogram(Z)
plt.title('Dendogram of CLusters')
plt.ylabel('Euclidean Distance')
plt.show()
InĀ [24]:
optimal_k=4
InĀ [25]:
from sklearn.cluster import AgglomerativeClustering
agg_cluster = AgglomerativeClustering(
n_clusters=optimal_k, metric="euclidean", linkage="ward"
)
agg_labels = agg_cluster.fit_predict(X)
InĀ [26]:
print(f"Silhouette Score: {silhouette_score(X, agg_labels)}")
Silhouette Score: 0.19928877106753798
InĀ [27]:
np.unique(agg_labels), agg_labels
Out[27]:
(array([0, 1, 2, 3], dtype=int64), array([1, 1, 1, ..., 1, 1, 3], dtype=int64))
InĀ [28]:
Hierarchy_data = np.vstack((components_tsne.T, agg_labels)).T
Hierarchy_data
Out[28]:
array([[-0.60867536, 4.61441231, 1. ],
[ 1.26883852, 7.63253117, 1. ],
[ 2.23190188, 6.16152525, 1. ],
...,
[-0.61595494, 6.51414299, 1. ],
[-2.77767849, 5.40629578, 1. ],
[ 1.26801634, 1.6698674 , 3. ]])
InĀ [29]:
hierarchy_tsne = pd.DataFrame(Hierarchy_data, columns=["X1", "X2", "clusters"])
hierarchy_tsne.head(10)
Out[29]:
| X1 | X2 | clusters | |
|---|---|---|---|
| 0 | -0.608675 | 4.614412 | 1.0 |
| 1 | 1.268839 | 7.632531 | 1.0 |
| 2 | 2.231902 | 6.161525 | 1.0 |
| 3 | 1.539832 | 5.844163 | 1.0 |
| 4 | 7.277162 | 2.938786 | 3.0 |
| 5 | -0.075848 | 1.356522 | 1.0 |
| 6 | -1.268180 | 4.740068 | 1.0 |
| 7 | 7.494108 | -1.875750 | 3.0 |
| 8 | 4.927003 | 5.426473 | 1.0 |
| 9 | -3.845872 | 6.579289 | 1.0 |
InĀ [30]:
plt.figure(figsize=(15, 8))
sns.scatterplot(x="X1", y="X2", hue="clusters", data=hierarchy_tsne, palette="tab10")
plt.title('Hierarchical Clusters')
plt.show()
Compare the clustering results of all the algorithms using Inertia and the Silhouette Score.¶
InĀ [31]:
print(f"Hierarchal Clustering Silhouette Score: {silhouette_score(X, hierarchy_tsne['clusters'])}")
Hierarchal Clustering Silhouette Score: 0.19928877106753798
InĀ [32]:
print(f"GMM Clutering Silhouette Score: {silhouette_score(X, GMM_tsne['clusters'])}")
GMM Clutering Silhouette Score: -0.043183584125432585
InĀ [33]:
print(f"KMeans++ Silhouette Score: {silhouette_score(X, Kmeans_tsne['clusters'])}")
KMeans++ Silhouette Score: 0.2509862709654033
The Silhouette Score of all the Models Turns out to be Good
- The score is near to 1 which is a good sign
The Best Result is given by Gausian Mixture Model
Visualize the clusters formed using T-SNE for all the three algorithms.¶
InĀ [34]:
plt.figure(figsize=(20, 8))
plt.subplot(1,3,1)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=hierarchy_tsne, palette="tab10")
plt.title("Hierarchical Clusters")
plt.subplot(1, 3, 2)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=GMM_tsne, palette="tab10")
plt.title("GMM Clusters")
plt.subplot(1, 3, 3)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Kmeans_tsne, palette="tab10")
plt.title("KMeans++ Clusters")
plt.show()
Expected Insights¶
Identification of distinct patient groups based on health and lifestyle attributes.¶
InĀ [39]:
Orignal_data = np.vstack((components_tsne.T, y)).T
Orignal_tsne = pd.DataFrame(Orignal_data, columns=["X1", "X2", "clusters"])
InĀ [40]:
plt.figure(figsize=(20, 15))
plt.subplot(2, 1, 1)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Orignal_tsne, palette="tab10")
plt.title("Orignal Clusters")
plt.subplot(2, 1, 2)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=GMM_tsne, palette="tab10")
plt.title("GMM Clusters")
plt.show()
Comparison of clustering algorithms to determine which provides the most meaningful segmentation.¶
InĀ [41]:
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Orignal_tsne, palette="tab10")
plt.title("Orignal Clusters")
plt.subplot(2, 2, 2)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=hierarchy_tsne, palette="tab10")
plt.title("Hierarchical Clusters")
plt.subplot(2, 2, 3)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=GMM_tsne, palette="tab10")
plt.title("GMM Clusters")
plt.subplot(2, 2, 4)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Kmeans_tsne, palette="tab10")
plt.title("KMeans++ Clusters")
plt.show()
\
/
\
/
\
/
\
/
\
/
\
/
Apply Clustering Models¶
InĀ [42]:
# K-Means Clustering
optimal_k = 4 # Choose based on Elbow/Silhouette
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
X["KMeans_Cluster"] = kmeans.fit_predict(X)
InĀ [44]:
# Gaussian Mixture Model (GMM)
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=optimal_k, random_state=42)
X["GMM_Cluster"] = gmm.fit_predict(X)
InĀ [45]:
# Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
agg_clustering = AgglomerativeClustering(n_clusters=optimal_k, linkage="ward")
X["Hierarchical_Cluster"] = agg_clustering.fit_predict(X)
Analyze Cluster Insights¶
InĀ [47]:
X.groupby("GMM_Cluster").mean()
Out[47]:
| age | gender | chest pain type | blood pressure | cholesterol | max heart rate | exercise angina | plasma glucose | skin_thickness | insulin | bmi | diabetes_pedigree | hypertension | heart_disease | Residence_type | smoking_status | KMeans_Cluster | Hierarchical_Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GMM_Cluster | ||||||||||||||||||
| 0 | 57.949393 | 0.497470 | 0.838057 | 124.438765 | 5.243208 | 161.608300 | 0.096660 | 4.548247 | 39.415992 | 4.700638 | 3.292613 | 0.383005 | 0.070850 | 0.043016 | 0.352733 | 0.630314 | 3.0 | 0.338057 |
| 1 | 54.518430 | 0.676576 | 0.045779 | 84.481570 | 5.165301 | 165.567776 | 0.004162 | 4.623849 | 35.964328 | 4.724797 | 3.331616 | 0.374696 | 0.062426 | 0.027943 | 0.265755 | 0.460166 | 1.0 | 1.914982 |
| 2 | 57.939480 | 0.516403 | 0.983597 | 128.473416 | 5.252657 | 162.115385 | 0.119910 | 4.542947 | 78.730204 | 4.695637 | 3.262908 | 0.383483 | 0.066742 | 0.033371 | 0.359163 | 0.576640 | 0.0 | 2.336538 |
| 3 | 59.454427 | 0.435547 | 0.137370 | 96.428385 | 5.174824 | 165.274089 | 0.013021 | 4.534976 | 76.798177 | 4.708984 | 3.351350 | 0.384269 | 0.087891 | 0.054688 | 0.477865 | 0.833333 | 2.0 | 1.274089 |
Visualize Clustering Results¶
InĀ [49]:
from sklearn.decomposition import PCA
import seaborn as sns
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(20, 20))
plt.subplot(2,2,1)
sns.scatterplot(
x=X_pca[:, 0], y=X_pca[:, 1], hue=X["KMeans_Cluster"], palette="viridis"
)
plt.title("KMeans")
plt.subplot(2, 2, 2)
sns.scatterplot(
x=X_pca[:, 0], y=X_pca[:, 1], hue=X["GMM_Cluster"], palette="viridis"
)
plt.title("GMM")
plt.subplot(2, 2, 3)
sns.scatterplot(
x=X_pca[:, 0], y=X_pca[:, 1], hue=X["Hierarchical_Cluster"], palette="viridis"
)
plt.title("Hierarchical_Clusters")
plt.subplot(2, 2, 4)
sns.scatterplot(
x=X_pca[:, 0], y=X_pca[:, 1], hue=y, palette="viridis"
)
plt.title("orignal data")
plt.show()
InĀ [52]:
pd.crosstab(df['triage'], X["KMeans_Cluster"])
Out[52]:
| KMeans_Cluster | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| triage | ||||
| green | 75 | 169 | 87 | 109 |
| orange | 172 | 1 | 8 | 165 |
| red | 24 | 53 | 14 | 38 |
| yellow | 1470 | 1277 | 1289 | 1601 |